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In the brief time we have this morning, I hope to dispell a few 
generalizations that have been propagated about APL. In addition, I 
will present a few, hopefully less untrue, generalizations of my own. 
During the session, you may detect some of my 2-bit philosophy on APL. 
This is unfortunate, and I encourage you all to yawn widely if I start 


moralizing. 


APL presents us with a dichotomy. It is a tool of expression 
used to define algorithms and their efficiencies; yet it is also an 
instrument of execution, a solver of real-life problems. Essentially, 
it is a language and an implementation. I was recently amused by a 
sentence in a terminal manufacturer's advertisement which states: "In 
the beginning, Iverson planned nothing so muncane as implementation of 
the [APL] language, even though it had the sweet scent of interpreter 
capabilities almost from the first"; which seems to imply that either 
we are all very mundane people or Ken Iverson had a nose problem. 
Whichever, it does point out that we are using an IBM program product 
computer language, not the Ken Iverson product. We are also faced 
with an existing implementation, not a proposed, or theoretical, or 
experimental version. APL as conceived by Iverson has been borne into 
the world of integrated circuits, monolithic memories, and 
data-processing budgets, and it is a different animal. Let us see how 


we can best tame that beast. 


I don't pretend to know anything but a smattering of how 
computers, or even the internals of APL, work; Dut I am familiar with 
a few of the fundamentals, for instance storage of a variable: 


MISC. I.D. DATA IN 


AND POINTERS RAVELED FORM 
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Given this knowledge, one could easily deduce that monadic rho 
ought to be very quick as, indeed, it is. Further, one could 
reasonably assume that given an array of N elements, any reshape of 
that array that still contained N elements would be trivial. Now, 
uSing a little common sense, let's see if we can decide which should 
be faster. Given: A+B<«11000 

A,C.5]2 ox A,[1.52 
In both cases, the system will increment the rank by 1 and catenate a 
2 to the shape. The difference lies in the data handling: in the 
second case the system must alternately merge the elements of A and B 
to produce the laminate; while in the first, it simply slaps B at the 
end of A and goes back to sleep....hints as to whether a matrix 


database ought be row or column oriented. 


In the above, I used some over-simplified knowledge of how APL 
stores data, and added my intuitive feelings about what really has to 
be done to perform the operation - I say that because I don't know how 
it 1s performed internally. Let's trv another intuitive example. 
Given: A+<?20 209100 which is faster? 

A+.xA4 or AIBA 
If most of you are like me, you've had this gut feeling since the 
second grade that addition and multiplication are easier than 
division. So if you guessed the first was faster, you have good 
intuition, because it is. What I am trying to point out is: 


1. USE COMMON SENSE IN PROGRAMMING. 
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Let's take another example. To find the average of a vector A, 
the common formula is (+/A)+#pA. Occasionally, one sees code like 
+/A+pA which must (?) be faster because it has less characters; and 
some supposed hotshots take this code one step further to At.+pA. Now 
I ask all you reasonable people, which is faster? If we reason the 
code out for A+«?100091000, we see that the first case performs 1000 
adds, and one division. The second and third require 1900 divisions 
plus 1900 adds. Inner product happens to be a bit nasty for boolean 
and floating point data types, hence the discrepancy between 2 and 3. 
Observe that APL\360 is not intelligent, and does quite literally what 
it is told to do. Note also that the parentheses are useful, even 
though they do not affect the result - in fact, you should, 


2. USE PARENTHESES TO DEFINE AND LIMIT YOUR COMPUTATIONS. 


A more suitable example would be: given 2 scalar quantities (A 
and B) and a vector (1C), which would be more efficient? 

AtBeal bee vA Bw 
The first case needs 2xC additions, while the second only requires C+1 
additions. Yet even advanced APL programmers often write code like 


the first to save a pair of parentheses....enough said. 


While we are on the subject of generating shifted index vectors, 
I might point out that 441B+A is often faster than A+iB if B is 
sufficiently large, and A is small enough to avoid WS FULL errors. 
Again, we can relate the reasons to just what the system must do 


internally. 
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If A<«100 and 3+1000, the first case involves a scalar add 
(trivial), an index generation (trivial), and a drop (trivial); but 
the second requires a slightly smaller index generation (trivial, no 
significant difference), and 1900 scalar adds (not so trivial). 
Essentially, 

3. KNOW YOUR ARGUMENTS. 

If you do, there is a good chance you may be able to take advantage of 


a more efficient formulation. 


For instance, in the case of selection, we fundamentally have 3 
operators from which to choose - indexing, take/drop, and compression; 
and, although they often emulate each other in result, they have very 
different purposes and reasons for existence. As a general rule, use 
indexing to nitpick or reorder an array; use take/drop to select large 
chunks; and use compression in the obvious case of selecting on the 
basis of a true-false condition. Be cognizant of the fact that 
indexing must first determine the location of every indexed point 
(through some kind of internal base value operation), then pick every 
value individually. On the other hand, take and drop can simply set 
up a loop and zoom through until they hit (some function of) the left 
argument. I can best illustrate by first selecting a very large, then 
a very small, chunk of an array A+40 409100. B and C (B<C+2+138) will 
be row/column indices; D and £# are set up (D<+36 36; E+2 2) to grab the 
same chunk. The benchmark indicates that the take/drop formulation is 
somewhat greater than seven times faster. So much for indexing you 
say? For small chunks I suppose the initial overhead Jegiteee for the 
setup of the take/drop loops is the killer, and indexing gets the nod. 
Note that compression is competitive - and certainly better than 


generating an index vector if you have a bitvector. 
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Again, we have seen a case where knowledge of the arguments can lead 
to more efficient coding. Incidentally, for vectors, a good rule of 
thumb for chunk selection is to use take/drop for greater than 20 


elements, indexing otherwise. 


A, KNOW YOUR OPERATORS, 

Now, that statement may seem pedantic to some of you, but I'll 
bet that almost every person in this room has at least one operator he 
doesn't really understand, or seldom uses, or is simply scared 
of....mine happens to be dyadic transpose. APL has such a plethora of 
operators that people have gone lifetimes - well, not really - years 
with knowledge of only a limited subset of the language. Certainly a 
subset is adequate, but who knows what you could be missing? I 
suggest you take that operator and try everything under the sun with 
it. Al Rose and I, independently, did just that with the decode 
operator. Al came up with a one-line dyadic iota formulation for 
array left arguments, initially used to left- and right-justify 
character matrices. I came up with some interesting timings 
indicating just how fast base value is. In fact, for A+«B«11000, 
B+1000xA is about the slowest way to combine them. Better is 
1000 1+.x4,[€.5]B, and best of all, my old nemesis 10001A,[.5]B. Note 
that if A is a scalar, instead of the previous 1000 multiplications, 
we only need to perform one before trundling on through the additions. 
Hence, I'll make the qualified suggestion to, 

5. USE THE SPECIALIZED OPERATOR. 

In the above case, base value did precisely what we wanted, with 
less potential riff-raff than inner product with its 441 permutations, 
or the generalized scalar operators. You might want to try timing for 


yourself, (where A+*i1000): AxA vs. Ax*2, 
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As we might expect, the multiplication will be faster. After all, 
exponentiation does a myriad of different operations, and is 


constantly trying to decide which one. 


I shall pass by my next point briefly: 

6. WATCH INTERNAL DATA CONVERSIONS 
because I hated DECLARE statements in PL/1, and dislike the whole 
messy business of the same number taking up l, 32, or 64 bits in APL. 
Suffice it to say that the internal representation of a number does 
make a significant difference in what internal APL must go through to 
produce a result. A common example is integer to real conversions, 
and the extra time required for floating point calculations. For 
example, given A+11000 and B+«3.5, why is 2++/A so much faster than 
+/A,B? Among other things, the fact that the catenation forces a 
conversion of A to floating point, and the fact that floating point 
addition takes longer than integer addition. But notice the drastic 
reduction in time when no conversion is required (A+.5+11000) even 
though floating point additions are required; and the slight speedup 


when only integer (A+1B<+1000) additions are required. 


It is appropriate to add two, rather undefined, suggestions: 

7. KNOW YOUR MACHINE. 

8. KNOW YOUR IMPLEMENTATION. 
Pay attention to speedups announced for your system; treasure any lore 
you pick up from systems programmers and other APL users - but aon’ t 
treasure it too long without trying it. Some commercial systems have 
performed extensive surgery on the program product, and achieved 
radical. speedups that make possible what was previously impractical or 


impossible in an APL workspace. 
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Some computers are designed to perform certain operations (yes, even 
division) extremely fast. I present 3 comparisons done on a 360/91 
and a 370/145. For A+?20 209100, the 91 performs about 4 times faster 
for integer inner product (A+.xA), and 43 times faster on dividing a 
matrix by its inverse (AHA). These are hardware speedups, the 
machine excelling in its own environment. An example of software 
speedup is x/20000p1 0 where the 145 operates about 56 times faster, 
or somewhat over 309 times faster considering the relative speeds of 


the two machines. The point should be sufficiently clear. 


Refore I branch into looping, I'll let you ponder the relative 
speeds of (for B<50 100p'ABC'; A+<5): 

AeB and BL[Adi(p8)C11;3] 
That the latter is twice as fast is a trifle disconcerting, and I 
can't explain it; indeed there are a few other constructs that can 
generally play havoc with primitive operators. But I don't want to 
frighten you....these cases are decidedly rare, and usually occur 
either because the internal code for the primitive was poorly written 
originally; or speeduns to other onerators have left a few in left 
field for the time being. Consider the anomaly of a man wearing rags 
and driving a Cadillac - he's either on his way up, or on his way 


down. 


I left the touchiest subject for last, because I'll probably 
receive the most harassment for it: 

9. DO NOT BE AFRAID TO (intelligently) LOOP. 
I have provided three paired examples for your perusal - each of which 
has a 'good', no-loop, practically one-line, advanced operator 


solution, and a 'had', looping, multi-line, simple operator solution. 
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Judge the merits for yourself. Note that the larger the problem, the 
more heavily favored the looping version becomes; indeed, for very 
small cases, the non-looping construct is somewhat faster. PBS2 in 
the first test was written by John Heckman of I.P. Sharp Associates; I 
share admiration of this delectable morsel of APL code with many 
people. Note that it needs only loop [ 2@oW times to produce a result, 
performing (pW)xf2epN additions. The supposedly elegant PBS1 requires 
(oN)*2 additions, multiplications, and relational tests. Sometimes 
APL operators are just too powerful for their own good: 

10. DON'T OVERKILL A PROBLEM - BE SPECIFIC. 
Simplify the problem as much as possible. Analyse just exactly what 
you want to do. Generalized solutions may be beautiful to look at, 
but for real-life data processing, they're slow. Use of inner product 
for large string searches is an example of overkill. Be aware that 
the inner product is blind; it doesn't stop if it detects a 0 ina 
column, but sashays down the entire column, ultimately requiring the 
A/ to get 0. FNDALL2 is not as muscular, but it is a great deal more 
athletic. It is continually narrowing what it has to search, until it 
either comes up empty-handed, or runs out of characters to 


test....athletic and intelligent. 


Finallv, my ultimate revenge over APZ non-loop purists, or how 15 
lines of heavy looping APL code can run circles around that "elegant" 
3-liner. ACCUM is modular, for easy reference; it loops quickly, does 
as little work as possible, takes advantage of known speedups, and 
leaves OLDWAY behind in the dust. In case you are wondering, all 
three of the looping functions can handle at least four times as large 
arguments in a workspace without engender ing WS FULLS, and in some 


cases can handle ten or more times as much data. 


9 


11. SAVE SPACE (and time) BY LOOPING. 
Of course, APL\360 is interpretive, and FORTRAN-type looping in APL 
would be disastrous; to be efficient, a looping program must: 1) set 
up as much as possible before the loop is entered; 2) do as much as 
possible with each iteration; and 3) waste as little time as possible 
incrementing counters, testing, and branching - in fact, not bad 
advice for anv language. I've enclosed a looping test to give you a 
rough idea of the relative speeds of different types of conditional 
branches. Two or three 60th discrepancies are not significant. LOOP 


uses constants; LOOPY uses labels as the target points. 


In conclusion, a little philosophy and a few qualifications. A 
program actually has 3 levels: 1) the overall process being 
performed; 2) the external steps and operations involved - API 
operators if you will; and 3) the internal workings of the machine in 
executing that APL code. APL allows you to ignore the last level; it 
simplifies and expedites the intermediate level; and ultimately allows 
you to concentrate on the real problem at hand. I advise you to get 
solutions first, then optimize for speed if you think it is necessary. 
The prelude to efficiency must be effectiveness. My whole talk has 
been based on the premise that APL is already solving your problems, 
but you want it to be a little less sluggish about it. Finally, don't 
take my word for anything - we may operate on different machines, 
different operating systems, different premises - I may be 
misinformed. Sit down at your 2741 and test, benchmark, experiment - 


then write me a letter telling me where I went wrong. 


APPENDIX 


The functions used as examples in this paper are 
listed in this appendix along with timings made on 
the APL*PLUS system in March 1973 running an IBM 
370 Model 145. 


Changes are anticipated in both the computer model 

and the APL*PLUS software which will speed up execution 
of functions. To obtain these functions for execution 
with the current APL*PLUS hardware and software: 
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